Welcome to Amethyst! The goal of Amethyst is to facilitate comprehensive analysis tools for single-cell methylation data. If you use this package or code on our Github for your work, please cite our manuscript.
If you are new to R, you may need to install some of the dependencies:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
library("BiocManager")
BiocManager::install(c("caret", "devtools", "data.table", "furrr", "future", "future.apply",
"ggplot2", "grDevices", "gridExtra", "igraph", "irlba", "janitor", "Matrix", "methods", "pheatmap",
"plotly", "plyr", "purrr", "randomForest", "rhdf5", "rtracklayer", "Rtsne", "scales", "stats", "stringr",
"tibble", "tidyr", "umap", "utils", "dplyr"))
Next, install a few additional dependencies found on Github, including amethyst itself.
devtools::install_github("JinmiaoChenLab/Rphenograph")
devtools::install_github("KrishnaswamyLab/MAGIC/Rmagic")
devtools::install_github("TomKellyGenetics/leiden")
devtools::install_github("lrylaarsdam/amethyst")
Now load libraries into R:
library(leiden)
library(amethyst)
library(data.table)
library(ggplot2)
library(tibble)
library(tidyr)
library(plyr)
library(future)
library(furrr)
library(purrr)
library(cowplot)
library(pheatmap)
library(dplyr)
First, download the practice data. This vignette comes with site-level CpG methylation information from 50 human banked PBMCs. Download this small dataset from AWS (Amazon Web Services).
Note: By default, data will download to the “~/Downloads/” folder. Change if a different directory is desired.
download.file("https://adeylabopen.s3.us-west-2.amazonaws.com/amethyst/pbmc_vignette.h5", "~/Downloads/pbmc_vignette.h5", method = "curl") # Contains site-level methylation information for each cell
The file we’ve just downloaded is in an hdf5 format. This data storage structure was chosen because it facilitates reading the data to memory in small chunks as needed, which is really useful for massive datasets like single-cell methylation. The diagram below shows the structure of datasets and groups within the h5 file:
We will also download some helpful metadata for each cell. The first file is a cellInfo.txt file, which is produced by the content_extract step of initial processing with Premethyst. The ScaleMethyl pipeline has an analagous file, allCells.csv, which is automatically loaded if you use the createScaleObject helper function (see Assembling the Amethyst object: ScaleMethyl-specific assembly instructions.)
download.file("https://raw.githubusercontent.com/lrylaarsdam/amethyst/main/vignettes/pbmc_vignette/pbmc_vignette_cellInfo.txt", "~/Downloads/pbmc_vignette_cellInfo.txt")
The second metadata file we will download is a .annot file, which is a manually assembled text file commonly used within the Adey lab. It contains the key for barcode sequences and associated samples. We are able to map this using knowledge of how the samples were distributed in the plate at the first barcoding step. This file is not essential to all workflows and is mostly a helper function for internal Adey lab use - the important thing is to get an idea for how we are assembling the metadata.
download.file("https://raw.githubusercontent.com/lrylaarsdam/amethyst/main/vignettes/pbmc_vignette/pbmc_vignette.annot", "~/Downloads/pbmc_vignette.annot") # Simulated batch metadata
Now construct an Amethyst object. In R, an object structure is a convenient method for storing all our project information in one place. We will add paths to the h5 file, metadata, aggregated methylation values, results, and more. Each type of data goes in specific “slots” so Amethyst knows where to retrieve it. Generate an empty object with the createObject command.
obj <- createObject()
If you used the Premethyst pipeline for initial processing, Amethyst has helper functions to add useful metadata about each cell from the intermediate output metric files. This includes quality control metrics contained in the .cellInfo.txt file or .annot files.
obj <- addCellInfo(obj, file = "~/Downloads/pbmc_vignette_cellInfo.txt")
obj <- addAnnot(obj, file = "~/Downloads/pbmc_vignette.annot", name = "batch")
Next, we need to specify the location of the h5 file containing site-level methylation data for each barcode. In this case, every barcode belongs to the same h5 file, but an unlimited number of h5 files can be used in the same object. It is very important that this slot correctly points each barcode to where the data is stored.
Note: If you are combining experiments with overlapping barcodes, please see our vignette for addressing this. The h5paths slot is set up differently. Also, please note we have changed the heading from ‘paths’ (v0.0.0.9000) to ‘path’ (v1.0+) to be more consistent, given the addition of the barcode and prefix columns.
obj@h5paths <- data.frame(barcode = rownames(obj@metadata), path = rep("~/Downloads/pbmc_vignette.h5", length(rownames(obj@metadata))))
head(obj@h5paths)
## barcode path
## 1 ACGCGACGGCACGAGAATCACTGTCATG ~/Downloads/pbmc_vignette.h5
## 2 ACGCGACGGCTATACCGAAGCGCCTATA ~/Downloads/pbmc_vignette.h5
## 3 ACTTCTGCCAACGCTTCGTCAGTACTCC ~/Downloads/pbmc_vignette.h5
## 4 ACTTCTGCCAACGCTTCGTCGATGATCC ~/Downloads/pbmc_vignette.h5
## 5 ACTTCTGCCAAGATAGAGGTCGTCTGCG ~/Downloads/pbmc_vignette.h5
## 6 ACTTCTGCCAATAATCGCGGCTGATGCT ~/Downloads/pbmc_vignette.h5
Amethyst has a helper loading function for output of the ScaleMethyl pipeline. This will generate an amethyst object, add metadata, and load the specified matrices.
Note: If you are combining experiments with overlapping barcodes, please see our vignette for addressing this. The h5paths slot is set up differently. Also, please note we have changed the heading from ‘paths’ (v0.0.0.9000) to ‘path’ (v1.0+) to be more consistent, given the addition of the barcode and prefix columns.
# This is just for real data illustration purposes - do not run if you are processing vignette data.
# obj <- createScaleObject(directory = "~/Downloads/ScaleMethyl.out/samples", genomeMatrices = list("CG.score", "CH"))
Any pre-processing platform can be used. The important thing is that the data is in the expected structure. The diagram below illustrates each Amethyst object slot and its respective data contents. For example, the metadata can be assembled in any manner, as long as the result is a data.frame with cell barcodes as row names.
It can be helpful to filter cells with outlying coverage values right away so downstream functions don’t unnecessarily perform calculations for cells that will not be used. First, view the coverage distribution - i.e., how many cytosine positions are captured per cell. We want to filter out cells that will have too sparse of information to properly cluster, or filter cells with outlying extremely high coverage values. With the commands below, we are simply plotting the distribution of coverage values in the metadata with ggplot, then filtering failed cells from the metadata with dplyr logic.
Note: Vignette data has been pre-filtered. We recommend cells have a minimum of 1M cytosines covered, if possible. Filter by any additional metrics in your metadata as needed.
ggplot(obj@metadata, aes(x = cov)) + geom_histogram(bins = 10) + theme_classic()
obj@metadata <- obj@metadata |> dplyr::filter(cov > 100000 & cov < 40000000)
The next step is to cluster cells, which we typically do based on methylation values over fixed genomic windows. Depending on the size and depth of your dataset, calculating methylation levels over features can be a computationally intensive step. There are three ways to do this (only do one):
Note: Clustering methods for single-cell methylation data, as a newer modality, will require more hands-on interpretation of results. Please thoroughly check that your results are not driven by technical issues (like coverage bias) and expect that this step will take some optimization.
The most straightforward method is to calculate methylation levels over genomic windows with Amethyst in R.
First, perform an initial indexing step, which determines the locations corresponding to each chromosome in every cell’s h5 file. This reduces the computational load by only calculating across one at a time. The output is a list of data frames. Each data frame corresponds to the where the base-level info starts and ends for each chromosome and cell.
Note: Many functions in Amethyst can be parallelized by increasing the threads parameter, but this is often not practical on a local computer. Do not increase the thread number for this vignette, but increase when performing initial computationally heavy steps on a server.
obj@index[["chr_cg"]] <- indexChr(obj, type = "CG", threads = 1)
head(obj@index[["chr_cg"]][["chr1"]])
## cell_id chr start count
## <char> <char> <int> <int>
## 1: ACGCGACGGCACGAGAATCACTGTCATG chr1 1 73977
## 2: ACGCGACGGCTATACCGAAGCGCCTATA chr1 1 84137
## 3: ACTTCTGCCAACGCTTCGTCAGTACTCC chr1 1 84501
## 4: ACTTCTGCCAACGCTTCGTCGATGATCC chr1 1 95146
## 5: ACTTCTGCCAAGATAGAGGTCGTCTGCG chr1 1 102023
## 6: ACTTCTGCCAATAATCGCGGCTGATGCT chr1 1 108912
Now that you have made the index, use makeWindows to calculate methylation levels over large genomic windows. The output is a data frame of windows (rows) x cells (columns). There are three options for the “metric” parameter: - “percent”, which is just percent methylated cytosines out of 100 - “ratio”, which is the methylation level of the window divided by the global average methylation level - “score”, which is a scaled measure of deviation from baseline - Formally, score is calculated as: (mCGfeature – mCGglobal)/(1 - mCGglobal) if mCGfeature – mCGglobal > 0, and (mCGfeature – mCGglobal)/(mCGglobal) if mCGfeature – mCGglobal < 0. The important thing to remember with “score” is -1 corresponds to an entirely unmethylated region, 0 is the same as average methylation, and 1 is fully methylated.
Note: If you are using non-conventional chromosomes, please
provide a whitelist (see function parameters).
Note: You may have to copy/paste any code reading the h5 file (e.g.,
makeWindows) directly into the console instead of running the chunk, if
using the .Rmd template. Note: An NA value indicates no
cytosines were captured for that given feature.
obj@index[["chr_cg"]] <- indexChr(obj, type = "CG", threads = 1)
obj@genomeMatrices[["cg_100k_score"]] <- makeWindows(obj,
stepsize = 100000,
type = "CG",
metric = "score",
threads = 1,
index = "chr_cg",
nmin = 2)
makeWindows can be memory-intensive. If you have very large datasets, it may be preferable to apply Facet, a Python-based helper script to compute windows in a more efficient manner. Results are stored in the h5 file and loaded as needed. For installation instructions and examples, see Facet documentation. The output in the example below would be the same h5 file with base observations stored under /context/barcode/1 and 100kb window results stored under /context/barcode/100000.
Note: Skip to the “clustering continued” section if you used makeWindows to calculate 100kb window methylation score.
# don't run; example of how to compute aggregated windows with facet
# facet agg -u 100000 pbmc_vignette.h5
If pre-computed windows are stored in the h5 file, load to the genomeMatrices slot with loadWindows. An indexing step is not needed for Facet. The result will be the same as makeWindows: a data frame of windows (rows) x cells (columns).
# don't run; example of how to load pre-computed aggregated windows with facet
# obj@genomeMatrices[["cg_100k_score_facet"]] <- loadWindows(obj, name = "100000", nmin = 2, metric = "score")
You may want to remove windows where most values are NA. In this case, since the vignette data is high coverage and the genomic windows are large, I am going to filter for at least 90% of the cells have values in that window. The appropriate threshold will highly depend on how big the windows are and how many cells you have.
Note: Please adjust this threshold in a dataset-specific manner. If you use short windows, a 90% threshold is way too high.
# Check that you won't be filtering too many rows first
nrow(obj@genomeMatrices[["cg_100k_score"]][rowSums(!is.na(obj@genomeMatrices[["cg_100k_score"]])) >= nrow(obj@metadata)*.9, ])
## [1] 28422
# Proceed if the number passing is reasonable (this is somewhat subjective based on your dataset and windowing strategy)
obj@genomeMatrices[["cg_100k_score"]] <- obj@genomeMatrices[["cg_100k_score"]][rowSums(!is.na(obj@genomeMatrices[["cg_100k_score"]])) >= nrow(obj@metadata)*.9, ]
Next, perform dimensionality reduction. If you are unsure how many
dimensions to use, the dimEstimate function can estimate the
number needed to explain the desired variance threshold.
Note: In this example, the number of requested output dimensions is
low because pbmc_vignette.h5 has 50 cells. Adjust according to your
data. Skip if it’s not helpful.
dimEstimate(obj, genomeMatrices = c("cg_100k_score"), dims = c(10), threshold = 0.95)
## cg_100k_score
## 7
As suggested, we will reduce the data from “cg_100k_score” into 7 dimensions using the irlba package, which performs fast truncated singular value decomposition. Feel free to implement alternative methods if desired. The output will be a data frame of cells (rows) by reduced dimensions (columns) with the corresponding component values.
set.seed(111)
# name the result whatever you want. I like descriptive names, at the cost of length.
obj@reductions[["irlba_cg_100k_score"]] <- runIrlba(obj, genomeMatrices = c("cg_100k_score"), dims = c(7), replaceNA = c(0))
Now determine cluster membership with runCluster. Either a Louvain-based or Leiden-based strategy can be used. Assign the output column name, which will be added to the metadata (or leave blank and it will be assigned “cluster_id”). This update was implemented so one can save multiple iterations within the same object.
Note: In this example, k is low because pbmc_vignette.h5 has 50 cells. Increase for larger datasets, depending on number of cells and expected heterogeneity. For example, we would typically use values of 30-100 for datasets of 1k-100k cells. A general rule of thumb is k = sqrt(n)/2. Note: I am setting the seed to try and make results reproducible. It is not essential, but good practice for your own data. Annoyingly, different R versions produce different random results with the same seed. I am using v4.3.0. Don’t worry if your projections look slightly different.
set.seed(111)
# run clustering with louvain-based method
obj <- runCluster(obj, k = 10, reduction = "irlba_cg_100k_score", method = "louvain", colname = "louvain_cluster_100k")
## Run Rphenograph starts:
## -Input data of 50 rows and 7 columns
## -k is set to 10
## Finding nearest neighbors...DONE ~ 0.001 s
## Compute jaccard coefficient between nearest-neighbor sets...DONE ~ 0.002 s
## Build undirected graph from the weighted links...DONE ~ 0.01 s
## Run louvain clustering on the graph ...DONE ~ 0.003 s
## Run Rphenograph DONE, totally takes 0.0160000000000053s.
## Return a community class
## -Modularity value: 0.5347043
## -Number of clusters: 4
# example: run leiden-based algorithm and store within the same object
obj <- runCluster(obj, k = 10, reduction = "irlba_cg_100k_score", method = "leiden", colname = "leiden_cluster_100k")
# look at how results are stored
head(obj@metadata)
## cov cg_cov mcg_pct ch_cov mch_pct batch
## ACGCGACGGCACGAGAATCACTGTCATG 27258073 923300 82.21 21341679 0.54 2
## ACGCGACGGCTATACCGAAGCGCCTATA 29188802 918053 81.16 22528736 0.53 2
## ACTTCTGCCAACGCTTCGTCAGTACTCC 25963119 1039143 71.63 21429728 0.45 2
## ACTTCTGCCAACGCTTCGTCGATGATCC 29345922 1144694 74.76 24104451 0.45 1
## ACTTCTGCCAAGATAGAGGTCGTCTGCG 27393943 1249755 68.49 23454887 0.54 1
## ACTTCTGCCAATAATCGCGGCTGATGCT 28318992 1264972 69.86 23883025 0.46 1
## louvain_cluster_100k leiden_cluster_100k
## ACGCGACGGCACGAGAATCACTGTCATG 1 3
## ACGCGACGGCTATACCGAAGCGCCTATA 1 3
## ACTTCTGCCAACGCTTCGTCAGTACTCC 2 4
## ACTTCTGCCAACGCTTCGTCGATGATCC 1 3
## ACTTCTGCCAAGATAGAGGTCGTCTGCG 2 4
## ACTTCTGCCAATAATCGCGGCTGATGCT 2 4
Next, project to 2D with runUmap and/or runTsne. Assign the output to a reductions slot. Like the runCluster update, this allows multiple projections to be stored within the same object. The results are 2D UMAP or TSNE coordinates with column names “dim_x” and “dim_y”. Future plotting functions like dimFeature will depend on this structure.
Note: Neighbors/perplexity values are small because this dataset is 50 cells. Increase for larger datasets, depending on number of cells and expected heterogeneity. A good starting point is k = sqrt(n)/2.
set.seed(111)
# name the result whatever you want. I like descriptive names, at the cost of length.
obj@reductions[["umap_irlba_cg_100k_score"]] <- runUmap(obj, neighbors = 5, dist = 0.05,
method = "euclidean", reduction = "irlba_cg_100k_score")
obj@reductions[["tsne_irlba_cg_100k_score"]] <- runTsne(obj, perplexity = 5, method = "euclidean",
theta = 0.2, reduction = "irlba_cg_100k_score")
There are many methods out there for clustering. One may wish to implement an alternative strategy, like clustering with MethSCAn variably methylated regions. This worked quite well for our small, high-coverage dataset. Results from other methods can easily be dropped in to an Amethyst object and analysis can proceed as normal. Please see our vignette for example alternative clustering methods. In the end, all you have to do is assign the output to a new reductions result as a data frame of x and y coordinates:
# theoretical example of 2D umap projection from methscan vmrs - do not run
# obj@reductions[["umap_methscan_vmrs"]] <- methscan_result
# colnames(obj@reductions[["umap_methscan_vmrs"]]) <- c("dim_x", "dim_y")
First, plot the dimensionality reduction results with dimFeature. Each point is a cell. x and y coordinates are derived from the UMAP or TSNE results we just calculated above. Points can then be colored by any variable in the metadata, allowing assessment of how different parameters are driving results. For example, use colorBy = louvain_cluster_100k to show how the different group assignments are distributed.
p1 <- dimFeature(obj, colorBy = louvain_cluster_100k, reduction = "umap_irlba_cg_100k_score", pointSize = .8) + ggtitle("UMAP CG 100k score")
p2 <- dimFeature(obj, colorBy = louvain_cluster_100k, reduction = "tsne_irlba_cg_100k_score", pointSize = .8) + ggtitle("TSNE CG 100k score")
plot_grid(p1, p2)
Here, you can also see if technical metrics like number of cytosines covered are driving your data. Below we are using dimFeature to plot the dimensionality reduction results and will color the cells according to the “cov” column values in obj@metadata.
dimFeature(obj, colorBy = cov, reduction = "umap_irlba_cg_100k_score", pointSize = .8) + ggtitle("Coverage distribution") + scale_color_gradientn(colors = c("black", "turquoise", "gold", "red"))
This dataset is very high-coverage, so we don’t see any bias, but often there will be cells that group together because they have very low or high coverage values. This is an unwanted technical artifact. We have developed the regressCovBias function to try and alleviate this. Below is an example of how to apply regressCovBias with a generalized additive model (“gam”). We recommend not including unless you observe bias in your data.
# Optional; helps reduce coverage bias in clustering. Run both ways and see if results are improved.
obj@reductions[["irlba_cg_100k_score_gam"]] <- regressCovBias(obj, reduction = "irlba_cg_100k_score", method = "gam")
# then re-run clustering and umap with the new dimensionality reduction output
obj@reductions[["umap_irlba_cg_100k_score_gam"]] <- runUmap(obj, neighbors = 5, dist = 0.05, method = "euclidean", reduction = "irlba_cg_100k_score_gam")
You might find that fixed genomic windows don’t give you good resolution of groups. Any feature set can be used for dimensionality reduction input. The makeWindows function can also calculate methylation levels over a bed file or genes (Note: calculation of methylation over genes is most applicable when looking at mCH in the brain.) Here is another clustering example using a set of pre-identified PBMC variably methylated regions (VMRs). These VMRs are compiled from analysis of the full version of this PBMC dataset.
First, we will download the bed file from Github, then feed it to makeWindows to calculate %CG over these bed file regions. The output is a data frame of windows (rows) x cells (columns) with corresponding %mCG for each region in the bed file. We will also remove windows with a very low number of observations across cells.
download.file("https://raw.githubusercontent.com/lrylaarsdam/amethyst/main/vignettes/pbmc_vignette/pbmc_highconfidence_dmrs.bed", "~/Downloads/pbmc_vmr.bed")
obj@genomeMatrices[["pbmc_vmrs"]] <- makeWindows(obj, bed = "~/Downloads/pbmc_vmr.bed", type = "CG", metric = "percent", threads = 1, index = "chr_cg", nmin = 2)
obj@genomeMatrices[["pbmc_vmrs"]] <- obj@genomeMatrices[["pbmc_vmrs"]][rowSums(!is.na(obj@genomeMatrices[["pbmc_vmrs"]])) >= nrow(obj@metadata)*.2, ]
Now re-run irlba with the VMR-based windows. We will assign the dimensionality reduction results to a new data frame within the reductions slot. That way, you can return to the previous reduction results if needed.
dimEstimate(obj, genomeMatrices = c("pbmc_vmrs"), dims = c(10), threshold = 0.90)
## pbmc_vmrs
## 8
set.seed(111)
obj@reductions[["irlba_cg_vmr_pct"]] <- runIrlba(obj, genomeMatrices = c("pbmc_vmrs"), dims = c(8), replaceNA = c(0))
Now re-run clustering and UMAP with the new VMR-based windows. We will use the plotting function dimFeature on the result to see if the clustering is tighter and the groups are more distinct from each other.
Note: I am setting the seed to try and make results reproducible. It is not essential, but good practice for your own data. Annoyingly, different R versions produce different random results with the same seed. I am using v4.3.0. Don’t worry if your projections look slightly different.
set.seed(111)
obj <- runCluster(obj, k = 10, reduction = "irlba_cg_vmr_pct", method = "louvain", colname = "louvain_vmr_cluster") # consider increasing k_phenograph to 50 for larger datasets
## Run Rphenograph starts:
## -Input data of 50 rows and 8 columns
## -k is set to 10
## Finding nearest neighbors...DONE ~ 0.001 s
## Compute jaccard coefficient between nearest-neighbor sets...DONE ~ 0.001 s
## Build undirected graph from the weighted links...DONE ~ 0.003 s
## Run louvain clustering on the graph ...DONE ~ 0 s
## Run Rphenograph DONE, totally takes 0.00500000000000256s.
## Return a community class
## -Modularity value: 0.5383537
## -Number of clusters: 4
set.seed(123) # the changing of seed number here is not important
obj@reductions[["umap_irlba_cg_vmr_pct"]] <- runUmap(obj, neighbors = 5, dist = 0.05, method = "euclidean", reduction = "irlba_cg_vmr_pct")
dimFeature(obj, colorBy = louvain_vmr_cluster, reduction = "umap_irlba_cg_vmr_pct", pointSize = .8)
In this context, VMRs worked much better as features than fixed genomic windows for clustering, so we will continue using results from this dimensionality reduction.
A word of caution: More clusters and higher separation does not always mean the result is better. Some approaches like iterative latent semantic indexing could exacerbate subtle differences that are not really biologically meaningful, or sometimes technical issues can result in high separation of groups where there shouldn’t be. In this context, we know PBMCs should have multiple distinct subtypes. Visualization functions like dimFeature show even distribution of technical variables like coverage, suggesting that cluster separation is biologically driven. Make sure to explore the distribution of any metadata you have.
dimFeature and many other visualization functions use ggplot logic, so you can easily modify plots as needed. For example, here we will facet the dimFeature results by the “batch” variable in obj@metadata.
dimFeature(obj, colorBy = louvain_vmr_cluster, reduction = "umap_irlba_cg_vmr_pct", pointSize = .8) + facet_wrap(vars(batch))
It is often helpful to see the distribution of metadata variables between samples. The command below will plot the cluster proportion distribution for each batch in a bar chart. The width of each color will correspond to the proportion of grouping metadata devoted to each cluster. In this context, we want to check that the proportions are relatively even between clusters to make sure that there is not drastic variation between batches. Again, plots can be easily modified with ggplot command logic.
Note: Since there are so few cells, the variation in proportion between batches is not concerning.
sampleComp(obj, groupBy = "batch", colorBy = "louvain_vmr_cluster")
If you want to make your plots look nicer, Amethyst provides many built-in color palettes. The function testPalette will show you what each option will look like with the required number of colors (one for each cluster, in this case).
testPalette(output = "swatch", n = length(unique(obj@metadata$louvain_vmr_cluster)))
If there is one you like, it is helpful to store the vector as a variable like “pal” for future use. Below we are illustrating how to store the pallete and recall it for use with dimFeature.
pal <- c("#F9AB60", "#E7576E", "#630661", "#B5DCA5") # makePalette(option = 7, n = 4)
dimFeature(obj, colorBy = louvain_vmr_cluster, colors = pal, pointSize = .8, reduction = "umap_irlba_cg_vmr_pct")
Now that we have clusters, the next step is determining their biological identity. This can be quite challenging for a novel modality like methylation. There are a couple ways we recommend going about this:
If you have an in-depth knowledge of the system, one useful method is to look at mCG hypomethylation over canonical marker genes. The first step is to load an annotation file for the reference genome so Amethyst knows the coordinates for each gene. The helper function makeRef loads this annotation file automatically from Gencode and puts it in the appropriate format for you. Use head to look at the result.
obj@ref <- makeRef("hg38")
head(obj@ref)
## seqid source type start end score strand phase gene_id
## <fctr> <fctr> <fctr> <int> <int> <num> <char> <int> <char>
## 1: chr1 HAVANA gene 11869 14409 NA + NA ENSG00000290825.1
## 2: chr1 HAVANA transcript 11869 14409 NA + NA ENSG00000290825.1
## 3: chr1 HAVANA exon 11869 12227 NA + NA ENSG00000290825.1
## 4: chr1 HAVANA exon 12613 12721 NA + NA ENSG00000290825.1
## 5: chr1 HAVANA exon 13221 14409 NA + NA ENSG00000290825.1
## 6: chr1 HAVANA gene 12010 13670 NA + NA ENSG00000223972.6
## gene_type gene_name level tag
## <char> <char> <char> <char>
## 1: lncRNA DDX11L2 2 overlaps_pseudogene
## 2: lncRNA DDX11L2 2 Ensembl_canonical
## 3: lncRNA DDX11L2 2 Ensembl_canonical
## 4: lncRNA DDX11L2 2 Ensembl_canonical
## 5: lncRNA DDX11L2 2 Ensembl_canonical
## 6: transcribed_unprocessed_pseudogene DDX11L1 2 <NA>
## transcript_id transcript_type transcript_name transcript_support_level
## <char> <char> <char> <char>
## 1: <NA> <NA> <NA> <NA>
## 2: ENST00000456328.2 lncRNA DDX11L2-202 1
## 3: ENST00000456328.2 lncRNA DDX11L2-202 1
## 4: ENST00000456328.2 lncRNA DDX11L2-202 1
## 5: ENST00000456328.2 lncRNA DDX11L2-202 1
## 6: <NA> <NA> <NA> <NA>
## havana_transcript exon_number exon_id hgnc_id
## <char> <char> <char> <char>
## 1: <NA> <NA> <NA> <NA>
## 2: OTTHUMT00000362751.1 <NA> <NA> <NA>
## 3: OTTHUMT00000362751.1 1 ENSE00002234944.1 <NA>
## 4: OTTHUMT00000362751.1 2 ENSE00003582793.1 <NA>
## 5: OTTHUMT00000362751.1 3 ENSE00002312635.1 <NA>
## 6: <NA> <NA> <NA> HGNC:37102
## havana_gene ont protein_id ccdsid artif_dupl location
## <char> <char> <char> <char> <char> <char>
## 1: <NA> <NA> <NA> <NA> <NA> chr1_11869_14409
## 2: <NA> <NA> <NA> <NA> <NA> chr1_11869_14409
## 3: <NA> <NA> <NA> <NA> <NA> chr1_11869_12227
## 4: <NA> <NA> <NA> <NA> <NA> chr1_12613_12721
## 5: <NA> <NA> <NA> <NA> <NA> chr1_13221_14409
## 6: OTTHUMG00000000961.2 <NA> <NA> <NA> <NA> chr1_12010_13670
Next, calculate methylation levels in short genomic windows for each cluster.
Deeper explanation: While calcSmoothedWindows sounds similar to makeWindows, they serve very different purposes. makeWindows returns a window (rows) x cell (columns) data frame for aggregated methylation values over very large regions. Here, we want to see methylation levels at a very fine resolution, which requires calculation over short regions. It is not feasible to make a window x cell data frame this big because there would be millions of windows per cell, threatening R’s internal vector limits. So, we pseudobulk by a metadata variable like cluster. We also smooth by averaging with both adjacent windows to try and alleviate some of the sparsity challenges that occur with methylation data. The output is a list of two data tables: “pct_matrix” with the pseudobulked, smoothed, %m over short windows; and “sum_matrix” with the sum of c (methylated) and t (unmethylated) counts use to produce “pct_matrix”. Though we won’t use the “sum_matrix” immediately, keep it, as we will need it for differentially methylated region testing.
We recommend 500bp windows for real data, but 1000 are
used here since the vignette dataset is only 50
cells.
Note: To avoid the hassle of recomputing these every time clusters
are adjusted, one can also calculate sliding smoothed windows with Facet
for each cell and store in the h5 file, then re-calculate/load more
rapidly per group with loadSmoothedWindows. See ?loadSmoothedWindows for
parameter specifications.
cluster1kbwindows <- calcSmoothedWindows(obj,
type = "CG",
threads = 1,
step = 1000, # change to 500 for real data unless you have really low coverage
smooth = 3,
genome = "hg38",
index = "chr_cg",
groupBy = "louvain_vmr_cluster",
returnSumMatrix = TRUE, # save sum matrix for DMR analysis
returnPctMatrix = TRUE)
Add the percent matrix output to the tracks slot. This will be used for visualizing methylation patterns over genomic regions with histograM or heatMap functions.
obj@tracks[["cg_cluster_tracks"]] <- cluster1kbwindows[["pct_matrix"]]
Now you can view methylation patterns over key marker genes with the heatMap function. This is a very versatile function and one of the best ways to view methylation tracks in Amethyst, so we recommend becoming acquainted with the parameters. Shown below are mean smoothed methylation levels per short genomic window with hypomethylated regions in blue and hypermethylated regions in light grey. Rows are pseudobulked values for each distinct population in the groupBy parameter. Below each plot, the gene body is shown, with orientation indicated by the arrow, exons in black rectangles, and the putative promoter (start +/- 1500 bp) in pink.
Note: By default, the arrow hangs over the gene by 3k bp so you can see it after the last exon. This can be adjusted with the arrowOverhang parameter.
heatMap(obj,
genes = c("CD2", "CD3E", "CD3D", "KLRB1", "CD22", "CD24", "CD79A", "CD14", "MPO"),
track = "cg_cluster_tracks",
nrow = 3,
legend = T)
The heatMap results suggest that group 1 are T cells (CD2+, CD3E+, CD3D+); group 2 are NK cells (KLRB1+); group 3 are B cells (CD22+, CD24+, CD79A+); and group 4 are monocytes (CD14+, MPO+).
Note: Not all genes will have distinct hypomethylation patterns over the expressing group. Many will have similar hypomethylation patterns despite established variation in transcription. Interpret with caution.
If you want to simultaneously view other information - like regulatory elements - these can be plotted as “tracks” below. In this example, we will use Encode candidate cis-regulatory elements (cCREs). The track structure should be a named list of data tables with chr, start, and end columns. First, download and split tracks into the expected structure.
# note this is for hg38 build only
library(rtracklayer)
download.file("https://hgdownload.soe.ucsc.edu/gbdb/hg38/encode3/ccre/encodeCcreCombined.bb", "~/Downloads/encodeCcreCombined.bb")
ccre <- as.data.table(rtracklayer::import("~/Downloads/encodeCcreCombined.bb"))
setnames(ccre, "seqnames", "chr")
ccre_tracks <- split(ccre, ccre$ucscLabel)
ccre_tracks <- lapply(ccre_tracks, function(x) {x[,.(chr, start, end)]})
Now plot tracks under the heatMaps by providing the list to the extraTracks parameter of heatMap:
heatMap(obj,
genes = c("CD2", "IRF8", "CD74"),
track = "cg_cluster_tracks",
nrow = 1,
legend = T,
extraTracks = ccre_tracks)
As you can see from the heatMaps, promoters are sometimes universally hypomethylated, or transposed from the predicted site. This helps illustrate the complexity of methylation patterns over gene bodies. Because of this, we prefer these tools to visualize over the whole gene body and surrounding context as opposed to aggregated metrics.
If you want to view other overlapping genes, use the regions parameter of heatMap instead of genes. This is very useful for looking at specific locations, like differentially methylated regions. In this example below, we will look at the region overlapping the gene S100A8. The helper function getGeneCoords extracts its genomic location from the reference annotation file for us.
heatMap(obj,
regions = getGeneCoords(obj@ref, gene = "S100A8"), # getGeneCoords is a shortcut to produce "chr1_153390032_153391073"
trackOverhang = 10000,
remove = "ENS", # optional; uses grep to remove genes that may be outside the question of interest
track = "cg_cluster_tracks",
extraTracks = ccre_tracks)
In addition to heatMap, you can also view methylation levels with the histograM function. This function has a similar effect, but methylation percentages are indicated as bar heights instead of colored tiles. Sometimes this is more helpful for illustrating patterns. Please see function documentation (with ?histograM) for a full list of which parameters can be changed.
histograM(obj,
baseline = "mean", # can plot from either mean or 0; see documentation
orientation = "cols",
genes = "CD3D",
track = "cg_cluster_tracks",
legend = F,
extraTracks = ccre_tracks)
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.
Despite the variability of methylation patterns over gene bodies, it can still be useful to look at aggregated metrics, especially when analyzing brain data. In this example we will use makeWindows to calculate %mCG for predicted promoter regions at protein coding genes. As previously mentioned, this is fraught with caveats, but we will use here for illustration purposes as it does work well for certain genes. We will also filter the resulting genes (rows) x cells (columns) matrix for genes with values in at least 10% of cells, filtering out a lot of very short genes that will not contribute meaningfully to this analysis.
protein_coding <- unique(obj@ref |> dplyr::filter(type == "gene" & gene_type == "protein_coding" & seqid != "chrM") |> dplyr::pull(gene_name))
obj@genomeMatrices[["cg_promoters"]] <- makeWindows(obj,
genes = protein_coding,
promoter = TRUE,
type = "CG",
metric = "percent",
threads = 1,
index = "chr_cg",
nmin = 5)
# subsetting to genes with values in at least 5 cells (10%) to reduce matrix size
obj@genomeMatrices[["cg_promoters"]] <- obj@genomeMatrices[["cg_promoters"]][rowSums(!is.na(obj@genomeMatrices[["cg_promoters"]])) >= nrow(obj@metadata) * 0.10, ]
Now you can view average %mCG of marker gene promoters by cluster with functions meant for visualizing aggregate values, like dotM. This plotting tool accepts two modalities - one will determine dot size and the other color. We will first calculate z score for color using the scale function. Then we will use dotM to plot genes on the x axis and groups on the y axis, with dot size according to mean promoter %mCG and color representing mean Z score for each gene x group.
obj@genomeMatrices[["cg_promoters_z"]] <- as.data.frame(scale(obj@genomeMatrices[["cg_promoters"]], center = TRUE, scale = TRUE))
genes <- c("SPI1", "CD2", "S100A8", "CD79A", "ELANE", "MPO", "MPEG1", "IRF8", "CD74", "CD3D")
dotM(obj, genes = genes, groupBy = "louvain_vmr_cluster", sizeMatrix = "cg_promoters", colorMatrix = "cg_promoters_z")
Again, functions like dotM can be easily modified with ggplot logic as desired. Below, we will plot the same thing as above, but this time apply ggplot parameters to facet by batch, adjust the color range, and adjust the size scale.
Note: In this small dataset, lack of dot means no value was collected.
dotM(obj, genes = genes, groupBy = "louvain_vmr_cluster", sizeMatrix = "cg_promoters", colorMatrix = "cg_promoters_z", splitBy = "batch") +
scale_color_gradientn(colors = c("#005eff", "grey60", "#ffa600")) + scale_size(range = c(1, 8))
It can also be helpful to use a less directed approach when determining differences between groups. There are functions to help with this like findClusterMarkers, which takes an input matrix and metadata grouping variable, then tests if any features are differentially methylated in that grouping variable relative to others. Below is an example findClusterMarkers call. For this vignette, we are just testing a subset of known marker genes, but for thorough data analysis it would better to test all protein coding genes.
cluster_promoter_markers <- findClusterMarkers(obj,
nmin = 5, # minimum number of cells with values over this gene in either the testing or comparison groups; increase to at least 10 for larger datasets
matrix = "cg_promoters",
genes = genes, # short subset for illustration purposes
groupBy = "louvain_vmr_cluster",
method = "BH", # recommended to use "bonferroni" for real data
threads = 1)
cluster_promoter_markers <- cluster_promoter_markers |> dplyr::filter(p.adj < 0.1) # Hardly any results because of dataset size; lower p.adj threshold for larger datsets
When plotting the result with dotM, we can indeed see that the promoter of T cell marker gene CD3D is hypomethylated in group 1, as identified by findClusterMarkers. Please note that in this example there are hardly any results because of our very small dataset size.
dotM(obj, genes = cluster_promoter_markers$gene, groupBy = "louvain_vmr_cluster", colorMatrix = "cg_promoters", sizeMatrix = "cg_promoters_z") +
scale_color_gradientn(colors = c("#005eff", "grey60", "#ffa600")) + scale_size(range = c(1, 12))
Another method one could use to aid in cell type determination is by comparison to an annotated reference. While few exist, we have put aggregated methylation levels per group over high-confidence PBMC VMRs calculated from the extended version of this dataset on Github. First, download this data, then calculate average methylation levels per cluster for each DMR (windows or any feature can also work) with aggregateMatrix. The result will be a window (rows) x groups (columns) matrix with mean values calculated for each feature by the grouping variable. The nmin parameter controls what the minimum number of cells are to be considered a distinct group. This reduces outliers that might arise from groups with very few cells.
download.file("https://raw.githubusercontent.com/lrylaarsdam/amethyst/main/vignettes/pbmc_vignette/pbmc_ref.RData", "~/Downloads/pbmc_ref.RData")
ref <- readRDS("~/Downloads/pbmc_ref.RData")
obj@genomeMatrices[["pbmc_vmrs_aggregated"]] <- aggregateMatrix(obj, matrix = "pbmc_vmrs", groupBy = "louvain_vmr_cluster", nmin = 2) # raise nmin for real data
Now view how well each of our clusters correlates with the pre-annotated data. First, merge together common windows from the pre-annotated data and our clusters. Then use the cor function to calculate Pearson’s r for pairwise complete observations. Finally, use pheatmap to plot comparisons across datasets.
cor <- cor(merge(ref,
obj@genomeMatrices[["pbmc_vmrs_aggregated"]],
by = 0) |> tibble::column_to_rownames(var = "Row.names"), use = "pairwise.complete.obs")
cor <- cor[c(1:ncol(ref)), c((ncol(ref) + 1)):ncol(cor)]
pheatmap(cor)
The correlation results agree with marker genes - group 1 is T cells, 2 is NK cells, 3 is B cells, and 4 are monocytes. Based on all these annotation tools, we can rename our clusters according to broad class using dplyr logic.
obj@metadata[["type"]] <- dplyr::recode(obj@metadata[["louvain_vmr_cluster"]],
"1" = "T",
"2" = "NK",
"3" = "B",
"4" = "Mono")
group_labels <- merge(obj@metadata, obj@reductions[["umap_irlba_cg_vmr_pct"]], by = 0) |> dplyr::group_by(type) |> dplyr::summarise(dim_x = mean(dim_x), dim_y = mean(dim_y))
library(ggrepel)
dimFeature(obj, colorBy = type, colors = pal, pointSize = 0.8, reduction = "umap_irlba_cg_vmr_pct") +
geom_text_repel(data = group_labels, aes(x = dim_x, y = dim_y, label = type))
You might also want to rename cluster tracks with the group annotations. Recalculate, or just copy the previous one and name like so:
obj@tracks[["cg_type_tracks"]] <- data.table::copy(obj@tracks[["cg_cluster_tracks"]])
data.table::setnames(obj@tracks[["cg_type_tracks"]], c("chr", "start", "end", "T", "NK", "B", "Mono"))
Now, we call the renamed track, heatMaps will be labeled with the corresponding population:
heatMap(obj,
genes = c("CD3D", "KIR2DL4", "MPO", "MPEG1"),
track = "cg_type_tracks",
nrow = 2,
legend = T)
At the core of single-cell methylation analysis is the identification of differentially methylated regions (DMRs). There are two main formats to set up DMR analysis. The first is to test DMRs for each cluster/type against all others. Only the sum matrix (which we saved at the calcSmoothedWindows step) is needed for the testDMR function. Alternatively, you could regenerate the calcSmoothedWindows output with your annotated cell types.
Note: A bug was recently found in testDMR. We recommend re-running testDMR for any analysis done with versions before v1.0.2. We sincerely apologize for the inconvenience.
dmrs <- testDMR(cluster1kbwindows[["sum_matrix"]], # Sum of c and t observations in each genomic window per group
eachVsAll = TRUE, # If TRUE, each group found in the sumMatrix will be tested against all others
nminTotal = 5, # Min number observations across all groups to include the region in calculations
nminGroup = 5) # Min number observations across either members or nonmembers to include the region in calculations
## Finished group 1
## Finished group 2
## Finished group 3
## Finished group 4
The output is the result of all tests. For each window and group, “c” is the total count of methylated observations, and “t” is the total count of unmethylated observations. A column for each test has been added with the resulting logFC and p value. However, this is not in a very user-friendly format. The output of testDMR is not really meant to be used as-is - it’s just an intermediate so one does not have to re-compute all tests if an alteration in stringency parameters is needed.
Next, expand and filter the resulting list according to the desired stringency.
dmrs <- filterDMR(dmrs,
method = "bonferroni", # c("holm", "hochberg", "hommel", "bonferroni", "BH", "BY", "fdr")
filter = TRUE, # If TRUE, removes insignificant results
pThreshold = 0.01, # Maxmimum adjusted p value to allow if filter = TRUE
logThreshold = 1.5) # Minimum absolute value of the log2FC to allow if filter = TRUE
head(dmrs)
## chr start end pval logFC padj direction test
## <char> <num> <num> <num> <num> <num> <char> <char>
## 1: chr1 868000 869000 3.782380e-16 3.2990 2.949170e-09 hyper 1_all
## 2: chr1 906000 907000 1.489341e-15 3.3188 1.161258e-08 hyper 1_all
## 3: chr1 1347000 1348000 7.903053e-19 2.8856 6.162111e-12 hyper 1_all
## 4: chr1 1426000 1427000 6.655007e-11 -1.9700 5.188994e-04 hypo 1_all
## 5: chr1 1512000 1513000 2.128572e-11 4.2218 1.659675e-04 hyper 1_all
## 6: chr1 1513000 1514000 3.860030e-12 2.0069 3.009715e-05 hyper 1_all
Often, adjacent genomic windows are differentially methylated. It can be helpful to collapse them into one locus with collapseDMR. As a helpful bonus, if annotation = T, any overlapping genes will be noted in the output results table.
Note: Please pay attention to the maxDist and minLength parameters and decide if these are appropriate for your data. Here, we are combining any DMRs within 2kb of each other, and applying a minimum threshold of 2kb to count as a DMR.
collapsed_dmrs <- collapseDMR(obj,
dmrs,
maxDist = 2000, # Max allowable overlap between DMRs to be considered adjacent
minLength = 2000, # Min length of collapsed DMR window to include in the output
reduce = T, # Reduce results to unique observations (recommended)
annotate = T) # Add column with overlapping gene names
head(collapsed_dmrs)
## Key: <chr, dmr_start, dmr_end>
## chr dmr_start dmr_end test direction dmr_length dmr_padj dmr_logFC
## <char> <num> <num> <char> <char> <num> <num> <num>
## 1: chr1 935000 940000 4_all hyper 5000 1.446852e-03 1.902650
## 2: chr1 1198000 1200000 3_all hypo 2000 1.460643e-05 -1.965950
## 3: chr1 1329000 1331000 3_all hypo 2000 1.713544e-03 -Inf
## 4: chr1 1512000 1514000 1_all hyper 2000 9.803231e-05 3.114350
## 5: chr1 1736000 1739000 3_all hypo 3000 6.779752e-10 -2.741867
## 6: chr1 2229000 2233000 1_all hypo 4000 6.404031e-10 -2.612975
## gene_names
## <char>
## 1: SAMD11
## 2: NA
## 3: NA
## 4: ATAD3A
## 5: ENSG00000268575, ENSG00000227775, ENSG00000290854, SLC35E2A
## 6: SKI
In the output, the “test” column indicates which comparison the DMR was identified in. For convenience, we will add an annotation column “type” to indicate which population the DMR corresponds to.
collapsed_dmrs$type <- dplyr::recode(collapsed_dmrs$test,
"1_all" = "T",
"2_all" = "NK",
"3_all" = "B",
"4_all" = "Mono")
Often, specific comparisons are desired, like testing between control and treatment groups of a specific cell type. A data frame can be provided describing the tests. Three columns should be included: One listing members of group A, one listing members of group B, and one with the name of the test.
Note: The “name”, “A”, and “B” columns are fixed; do not change or else testDMR won’t be able to extract the appropriate comparisons. Please also note that conditions for each test are concatenated with a single comma (no spaces in between) and wrapped in double quotes.
# don't run - this is just an example
comparisons <- data.frame(
stringsAsFactors = FALSE,
name = c("test1", "test2", "test3"),
A = c("1,2,3", "1", "2,3"),
B = c("1,4", "2", "1")
)
example_dmrs <- testDMR(sumMatrix = cluster1kbwindows[["sum_matrix"]], comparisons = comparisons, nminTotal = 5, nminGroup = 5)
If desired, any results can be stored in the results slot of the Amethyst object. This helps organize all analysis into one structure.
obj@results[["cg_collapsed_dmrs"]] <- collapsed_dmrs
After DMR testing, you should have a table of hypo and hypermethylated regions for each cluster. First, let’s look at how many DMRs were identified in each group with a simple ggplot command. We will plot a bar chart showing the number of DMRs identfied for each condition.
ggplot(collapsed_dmrs |> dplyr::group_by(type, direction) |> dplyr::summarise(n = n()),
aes(y = type, x = n, fill = type)) + geom_col() +
facet_grid(vars(direction), scales = "free_y") + scale_fill_manual(values = pal) + theme_classic()
It can be helpful to isolate top results per condition. I find it helpful to select by a combined metric of logFC and padj, because selecting by either metric alone tends to yield unwanted noise.
top_dmrs <- collapsed_dmrs |>
dplyr::group_by(type, direction) |>
dplyr::arrange(dmr_padj, .by_group = TRUE) |> dplyr::mutate(rank_padj = 1:n()) |>
dplyr::arrange(desc(abs(dmr_logFC)), .by_group = TRUE) |> dplyr::mutate(rank_logFC = 1:n()) |>
rowwise() |> dplyr::mutate(total_rank = sum(rank_padj, rank_logFC)) |>
group_by(type, direction) |> slice_min(n = 1, order_by = total_rank) |>
dplyr::mutate(location = paste0(chr, "_", (dmr_start - 2000), "_", (dmr_end + 2000))) |> dplyr::arrange(direction)
Next, use heatMap to plot top hypomethylated regions for each group. Some top hits show are over genes with known important biology relevant for the group - for example, TRAJ genes for T cells. This supports the validity of the DMR identification approach.
heatMap(obj,
track = "cg_type_tracks",
regions = top_dmrs$location[top_dmrs$direction == "hypo"],
nrow = 2,
legend = F)
You can also turn the DMRs into tracks to plot the exact locations identified below the heatMaps.
dmr_tracks <- split(collapsed_dmrs |> dplyr::filter(direction == "hypo") |> ungroup() |> dplyr::select(chr, dmr_start, dmr_end, type) |> dplyr::rename(start = dmr_start, end = dmr_end) |> data.table(), collapsed_dmrs[collapsed_dmrs$direction == "hypo"]$type)
heatMap(obj,
track = "cg_type_tracks",
regions = "chr14_99317000_99324000",
extraTracks = rev(dmr_tracks),
extraTrackColors = rev(pal))
Further interpretation of the results can be explored using a wide variety of packages available on R. In this example, we will use the topGO package to test for Gene Ontology (GO) term enrichments for genes with hypomethylated regions in the T cell group. First, extract the background list (all genes), then extract the query list from the DMR table.
Note: topGO is not a dependency of Amethyst. You may have to install this package separately.
# install if needed
BiocManager::install("topGO")
BiocManager::install("org.Hs.eg.db")
# load library
library(topGO)
# extract background and query gene lists
background <- rownames(obj@genomeMatrices[["cg_promoters"]]) # all genes tested
query <- unlist(strsplit(collapsed_dmrs$gene_names[collapsed_dmrs$type == "T" & collapsed_dmrs$direction == "hypo"], ", "))
# run topGO analysis
GOdata <- new("topGOdata",
description = "GO Enrichment Analysis",
ontology = "BP",
allGenes = setNames(factor(as.integer(background %in% query), levels = c(0, 1)), background),
geneSel = function(x) x == 1,
nodeSize = 10,
annot = annFUN.org,
mapping = "org.Hs.eg.db",
ID = "symbol")
resultElim <- runTest(GOdata, algorithm = "elim", statistic = "fisher")
resultElim <- GenTable(GOdata, Fisher = resultElim, topNodes = 500, numChar = 60)
# filter results
resultElim <- resultElim |> dplyr::filter(Fisher < 0.01 & Significant > 5) |> dplyr::mutate(fold_change = Significant/Expected, Fisher = as.numeric(Fisher))
resultElim <- janitor::clean_names(resultElim)
Use ggplot to note top results. Here we will just plot GO terms according to fold change (bar width) and adjusted p value (color) in a bar chart. As expected, hits are strongly related to T cell processes.
ggplot(resultElim, aes(x = fold_change, y = reorder(term, fold_change), fill = fisher)) + geom_col() + theme_classic() + scale_fill_viridis_c(direction = -1)
Finally, save your workspace so you don’t have to re-calculate everything in the future.
save.image("~/Downloads/pbmc_vignette_workspace.RData")
Thanks for trying out Amethyst! Additional utilities are still to come. We are also open to suggestions. If you use this package and/or code on our Github for your work, please cite our manuscript.
Good luck in your analysis!